feat(schema): represent, serialize and validate v3 column default values (1/4)#746
feat(schema): represent, serialize and validate v3 column default values (1/4)#746huan233usc wants to merge 1 commit into
Conversation
445f7ae to
4aade00
Compare
| std::shared_ptr<const Literal> initial_default_; | ||
| std::shared_ptr<const Literal> write_default_; |
There was a problem hiding this comment.
ReassignField constructs a new SchemaField via the 5-argument constructor which initializes initial_default_ and write_default_ to nullptr. When schema IDs are reassigned (e.g., copying a schema with fresh IDs via the Schema(get_id) path), all default values on fields are silently lost. We should copy all field properties including initialDefault and writeDefault.
There was a problem hiding this comment.
Good catch, confirmed. Defaults are now constructor args, and ReassignField passes the source field's initial_default_ptr()/write_default_ptr() through, so they're shared with the reassigned field, not lost. Added ReassignIdsPreservesDefaultValues.
| if (initial_default_json.has_value()) { | ||
| ICEBERG_ASSIGN_OR_RAISE(Literal literal, | ||
| LiteralFromJson(*initial_default_json, field.type().get())); | ||
| field = field.WithInitialDefault(std::move(literal)); | ||
| } | ||
| if (write_default_json.has_value()) { | ||
| ICEBERG_ASSIGN_OR_RAISE(Literal literal, | ||
| LiteralFromJson(*write_default_json, field.type().get())); | ||
| field = field.WithWriteDefault(std::move(literal)); | ||
| } |
There was a problem hiding this comment.
The deserialization first constructs a bare SchemaField, then conditionally calls WithInitialDefault/WithWriteDefault, each of which copies the entire field (including the shared_ptr<Type>). This is an unnecessary intermediate copy.
There was a problem hiding this comment.
Agreed — FieldFromJson now parses the defaults first and builds the field in one construction. Intermediate copy gone.
| SchemaField SchemaField::WithInitialDefault(Literal initial_default) const { | ||
| SchemaField copy = *this; | ||
| copy.initial_default_ = std::make_shared<const Literal>(std::move(initial_default)); | ||
| return copy; | ||
| } |
There was a problem hiding this comment.
I don't think it's need to copy the whole SchemaField, can we just set the initial_default_ field and return *this.
Also the following With methods.
There was a problem hiding this comment.
Agreed — moved defaults into the constructor and removed both With... methods, so construction no longer copies the field.
1ee5b32 to
34470af
Compare
There was a problem hiding this comment.
Pull request overview
Adds Iceberg v3 column default value support at the schema layer by carrying, JSON-(de)serializing, validating, and projecting initial-default / write-default literals (foundation for later read/write-path work).
Changes:
- Extend
SchemaFieldto storeinitial-default/write-defaultas shared immutable literals and include them in equality/ID-reassignment rebuilds. - Add JSON serde for
initial-default/write-defaultusing existing single-value literal serialization. - Update schema projection to use
FieldProjection::Kind::kDefaultwhen an expected field is missing but hasinitial-default, and add/extend unit tests + v3 metadata fixture.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/test/schema_util_test.cc | Adds projection tests for missing fields with initial-default and for ignoring defaults when the field is present. |
| src/iceberg/test/schema_test.cc | Adds schema validation/version-gating and ID-reassignment preservation tests for default values. |
| src/iceberg/test/schema_json_test.cc | Adds round-trip and failure tests for JSON serialization/parsing of default values (incl. nested structs). |
| src/iceberg/test/resources/TableMetadataV3Valid.json | Adds a v3-valid table metadata JSON fixture. |
| src/iceberg/schema.cc | Preserves defaults during ID reassignment and adds default-related validation in Schema::Validate. |
| src/iceberg/schema_util.cc | Projects missing fields with initial-default as kDefault rather than error/null. |
| src/iceberg/schema_field.h | Extends SchemaField API/storage to carry default literals and expose them via optional reference accessors. |
| src/iceberg/schema_field.cc | Implements default accessors, validation of defaults, and includes defaults in equality. |
| src/iceberg/json_serde.cc | Serializes/parses initial-default / write-default on schema fields via literal single-value serde. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
First of a multi-part split of column default value support (apache#730) — the schema foundation the read and evolution paths build on. Purely additive; no read/write behavior change on its own. - SchemaField carries `initial-default` / `write-default` (immutable std::shared_ptr<const Literal>) with copy-preserving WithInitialDefault / WithWriteDefault modifiers; getters return optional<reference_wrapper>. - JSON serde reads/writes `initial-default` / `write-default` via the existing single-value serialization. - Schema::Validate rejects default values below format v3 and validates they are non-null primitive literals matching the field type. - Generic schema projection maps a column missing from a data file with an initial-default to FieldProjection::Kind::kDefault. Read-path application (Parquet/Avro) and schema evolution follow in separate PRs. See apache#731 for the full end-to-end proof-of-concept.
34470af to
f663e0e
Compare
Part 1 of a multi-part split of #730 (column default values, item 2 of #637). The full
end-to-end implementation is in #731, kept open as the proof-of-concept; this series
lands it in reviewable pieces.
This PR is the schema foundation — representing, serializing and validating v3
column default values. It is purely additive and changes no read or write behavior on
its own.
What's in this PR
SchemaFieldcarriesinitial-default/write-default, stored asstd::shared_ptr<const Literal>(immutable payload shared across copies, like theadjacent
type_; the C++ analog of Java'sfinal Literal<?>). They are set via theconstructor. Getters return
std::optional<std::reference_wrapper<const Literal>>forreading (the
Schema::FindFieldByNameidiom);initial_default_ptr()/write_default_ptr()expose the shared pointer so a rebuilt field (e.g. IDreassignment) shares the value instead of copying it.
initial-default/write-defaultusing the existingsingle-value serialization (all primitive types).
Schema::Validate: version-gates theinitial-defaultto format v3(
kMinFormatVersionDefaultValues) — it reinterprets how existing data files are read,so it requires the v3 reader contract. The
write-defaultonly affects values writtengoing forward and is not version-gated (matching Java's
Schema.checkCompatibility,which gates only the initial default). Both defaults are otherwise validated to be
non-null primitive literals matching the field type.
initial-defaultmaps to
FieldProjection::Kind::kDefaultcarrying the literal (the per-format readersconsume this in the follow-up PRs).
Follow-ups (stacked on this PR)
literal_util+ parquet projection/materialization)UpdateSchemaadd/update column defaults)Testing
Added tests